Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters

Poster presentations at ISMB 2020 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2020. There are Q&A opportunities through a chat function to allow interaction between presenters and participants.

Preliminary information on preparing your poster and poster talk are available at: https://www.iscb.org/ismb2020-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

View Posters By Category

Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time
Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time
July 14 between 10:40 am - 2:00 pm EDT
A Bologna Biocomputing pipeline combining multiple tools for protein functional annotation in CAFA4
COSI: Function / CAFA 4
  • Giacomo Tartari, ELIXIR-IT, Italy
  • Giulia Babbi, Biocomputing Group Bologna, Italy
  • Pier Luigi Martelli, University of Bologna, Italy
  • Castrense Savojardo, University of Bologna, Italy
  • Davide Baldazzi, University of Bologna, Italy
  • Rita Casadio, University of Bologna, Italy

Short Abstract: Computational platforms for protein functional annotation are crucial in large-scale functional genomics projects. Here we present a novel pipeline for the prediction of protein function and its evaluation in the context of CAFA last edition. The pipeline combines several platforms previously developed by our group and based on different annotation approaches, including transfer by sequence-similarity, protein-protein interaction data (guilty-by-association) and pure machine-learning-based approaches. Specifically, the pipeline incorporates an updated version of the Bologna Annotation Resource v3.0 (BAR3, bar.biocomp.unibo.it/bar3/), predicting protein function from sequence similarity-based annotation clusters. We incorporate also the possibility of deriving annotation from the protein interaction networks of NETGE-PLUS (net-ge2.biocomp.unibo.it), a recently developed tool for network-based functional annotation. NETGE-PLUS “functional modules” derived from protein-protein interaction data (STRING) and associated to GO terms, annotate query protein sequences when they enter a module. The above methodologies were complemented by BUSCA (busca.biocomp.unibo.it), a server integrating different machine-learning tools for annotating subcellular localizations. Preliminary benchmarks showed that the combination strategy is effective for improving performance. Our data indicate the inclusion of complementary approaches allowed extending the prediction coverage of the 97,999 CAFA4 targets up to 84%, 82% and 100% for biological process, molecular function and cellular component terms, respectively.

A Single-Cell Proteomic Snapshot of Early T Cell Development in Human Infant Thymus
COSI: Function / CAFA 4
  • Yue Wang, Bioinformatics Division, BNRIST and Department of Automation, Tsinghua University, China
  • Fanhong Li, Bioinformatics Division, BNRIST and Department of Automation, Tsinghua University, China
  • Jian Gu, Research Unit of Liver Transplantation and Transplant Immunology, Chinese Academy of Medical Sciences, China
  • Ling Lu, Research Unit of Liver Transplantation and Transplant Immunology, Chinese Academy of Medical Sciences, China
  • Xuegong Zhang, Tsinghua University, China

Short Abstract: T cells play important roles in human adaptive immunity. Studying protein expression patterns during early T cell development is crucial for understanding their differentiation and functions. All human T cells are developed from thymus before adolescence, and all major subtypes of T cells appear in the infant period. We collected 5 thymic samples from human infants of 1-month to 1-year-old to explore characteristics and functions of multiple proteins. We applied single-cell mass cytometry to measure the expression of selected proteins, and inferred a T cell developmental tree using self-organizing map clustering and minimum spanning tree method. All major cell subtypes during development were captured. We observed that proteins associated with the immature period of T cells are enriched in functions of cell activation and proliferation, contributing to cell population enlargement and further differentiation. We also discovered protein expression patterns driving to the development of unconventional T cells, including CD8+ tissue-resident memory T cells, Th 17 cells and Tregs. Phosphorylated transcription factors in several signaling pathways were found highly expressed in different stages. The derived thymic T cell developmental tree and protein expression patterns provided a first proteomic snapshot of early T cell development at single-cell level in human infants.

Automatic function prediction in the 2020’s
COSI: Function / CAFA 4
  • Stavros Makrodimitris, Delft University of Technology, Netherlands
  • Roeland van Ham, Delft University of Technology, Netherlands
  • Marcel Reinders, Delft University of Technology, Netherlands

Short Abstract: This poster aims at initiating discussions among function prediction researchers about important challenges that the field will face in the near future. Specifically, we would like to raise the following issues:
1) Is the Gene Ontology itself hindering the progress of function prediction algorithms? And how could it be modified to solve this?
2) Biological Process functions can be cell-type specific and/or condition-specific, but are most often lumped together into a general function description. However, new developments, such as single-cell sequencing, have the potential to deconvolute this information. This creates new challenges and opportunities both for algorithm developers as well as curators, to represent and exploit this more detailed knowledge.
3) The Critical Assessment of Functional Annotation (CAFA) has shown that ensemble methods that use multiple data sources, such as gene expression and protein interactions are more effective than sequence-only methods. Unfortunately, these data are not available for non-model species, a fact that is hidden as CAFA mainly has to focus on model species. Can this limitation be overcome by computational means (e.g. predicting co-expression or having more complex sequence models) or do we need to generate experimental data for each new species we sequence?

CaoLab2: Protein Function Prediction using Hidden Markov Models
COSI: Function / CAFA 4
  • Yaroslav Kravchuk, Pacific Lutheran University, United States
  • San Nge, Pacific Lutheran University, United States
  • Renzhi Cao, Pacific Lutheran University, United States
  • Kyle Hippe, Pacific Lutheran University, United States
  • Sola Gbenro, Pacific Lutheran University, United States

Short Abstract: As the body of genomic product data increases at a much faster rate, computational analysis of protein function has never been more important. Here we introduce CaoLab2 server that attended the CAFA4 experiment. Hidden Markov Models (HMM), a prominent natural language processing technique, has been used in our method, which generates a unique HMM for each Gene Ontology (GO) function term. Variability in the number of sequences associated with each GO term caused an imbalance in the representation of GO terms. To resolve this, our method employs data augmentation to artificially inflate the number of protein sequences associated with GO terms that have a limited amount of data in order to balance the representation of these terms in our dataset. For GO terms that have less than 100 sequences associated with them, a Hidden Markov Model trained on the available data was used to generate new sequences to make a minimum of 100 sequences for each term. Predictions are made by running the sequence against each model created, the top eighty percent models producing the highest scores or 200 models with the highest scores, whichever is less, represent the functions that are most associated with the given sequence.

CaoLab: Protein function and disorder prediction from sequence based on RNN
COSI: Function / CAFA 4
  • Renzhi Cao, Pacific Lutheran University, United States
  • Kyle Hippe, Pacific Lutheran University, United States
  • Sola Gbenro, Pacific Lutheran University, United States

Short Abstract: A lot of progress has been made in the machine learning and natural language processing field. Here we introduce the CaoLab server that attended the latest CAFA4 experiment. We used natural language processing and machine learning techniques to tackle the protein function prediction and disorder prediction problem. ProLanGO2 is used to predict protein function using protein sequence, and ProLanDO would make protein disorder prediction from protein sequence. The latest version of UniProt database (on 12/12/2019) is used for extracting the top 2000 most frequent k-mers (k from 3 to 7) to build a fragment sequence database FSD. The ProLanDO method uses DO database provided by CAFA4 (www.disprot.org/) while each sequence is filtered with FSD, and the character-level RNN model is trained to classify the DO term. The ProLanGO2 method is an updated version of ProLanGO published in 2017, which uses the latest version of Uniprot database filtered by FSD. The Encoder-Decoder network, a model consisting of two RNNs (encoder and decoder), is used to train models on the dataset, and the top 100 best performing models are used to select ensemble models as the final model for protein function prediction.

Comparison of summarization methods of co-expressed gene list to predict gene function.
COSI: Function / CAFA 4
  • Yuichi Aoki, Tohoku University, Japan
  • Kengo Kinoshita, Tohoku University, Sendai, Japan, Japan
  • Takeshi Obayashi, Tohoku University, Sendai, Japan, Japan

Short Abstract: Gene co-expression, which is a similarity of gene expression profiles, provides fundamental information to investigate functions of genes. Basically, gene co-expression relationship is a continuous value without a clear threshold to separate co-expressed gene pairs and the others, and therefore it is often provided as a very long gene list for a guide gene of interest. In gene co-expression databases COXPRESdb, ATTED-II and ALCOdb, which are we have developed for animals, plants, and microalgae, respectively, the top 50 co-expressed genes for a guide gene are shown as default, but co-expressed genes after the top 50 also have some information about the guide gene. Gene Set Enrichment Analysis (GSEA) is a popular way to summarize annotations of a long gene list, which is adopted in several databases including STRING-DB and our co-expression databases. However, gene co-expression is a relationship of correlation, and therefore gene co-expression strength highly correlated even under random gene expression data. That property of gene co-expression distorts the result of enrichment test including GSEA. Here, we compare the summary method for co-expressed gene list with a special focus on threshold-based methods (K-NN and fisher exact test) and threshold-free methods (GSEA and Wilcoxon test).

Convolutional Neural Network Architectures for CAFA4
COSI: Function / CAFA 4
  • Jari Björne, University of Turku, Finland

Short Abstract: In this work, different convolutional neural architectures were examined for the
task of protein function prediction, targeting the GO, HPO and DO annotations.
Uniprot and CAFA4 target protein sequences used as training data were
enriched with additional information from protein taxonomical hierarchies and
InterProScan sequence analyses.

Both parallel and nested convolutional networks were tested, as well as a
one-dimensional variant of the MobileNet V2 image recognition network and a
network based on the DeepGOPlus architecture. All networks received as inputs
sequences consisting of amino acid embeddings and InterProScan domain, family
and homologous superfamily embeddings. The convolutionally processed output from
this data was merged together with embeddings representing levels of the
protein's organism's taxonomy lineage. Multi-label predictions were generated
for up to 1000 of the most common ontology terms.

The dataset was processed with six-fold cross-validation, with the test set
predictions finally merged together. Experiments were performed using either all
known annotations as labels, or using only the non-IEA (inferred from electronic
annotation) ones, and with either all proteins or with only the ones having at
least one annotated term. System performance was measured using the F1 and AUC
metrics, with the best performing systems' predictions submitted for the CAFA4
challenge.

Cross-species functional prediction by global network alignment
COSI: Function / CAFA 4
  • Wayne Hayes, UCI, United States

Short Abstract: We report the first successful cross-species pre-
diction of protein function based solely on topology-driven
global network alignment. Using SANA (the Simulated Annealing
Network Aligner), we pair the proteins of one species with those
of another solely by maximizing the number of aligned edges
between the networks. We find that SANA’s confidence, called
NAF, in each individual pair of aligned proteins correlates with
their functional similarity. We then apply SANA to BioGRID
3.0 networks from April 2010, and use GO data from the
same month to transfer GO annotations from better-annotated
proteins to lesser-annotated ones. We validate the predictions on
a recent GO release and find an AUPR of up to 0.4 depending
on the predicting GO evidence code, even when restricting
predictions to proteins that have no observed sequence or
homology relationship. Finally, we apply the same method
to recent BioGRID PPI networks of mouse and human, and
predict novel cilia-related GO terms in human proteins based on
their confident alignment with cilia-annotated mouse proteins;
the most confident predictions have literature validation rates
above 80%. We propose topology-based alignment of PPI
networks as a novel source for prediction of protein function
that is independent of sequence or structural information

DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier
COSI: Function / CAFA 4
  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Short Abstract: Predicting the phenotypes resulting
from molecular perturbations is one of the key challenges in
genetics. Both forward and reverse genetic screen are employed to
identify the molecular mechanisms underlying phenotypes and disease,
and these resulted in a large number of genotype--phenotype
association being available for humans and model organisms.
Combined with recent advances in machine learning, it may now be
possible to predict human phenotypes resulting from particular
molecular aberrations.
We developed DeepPheno, a neural network based
hierarchical multi-class multi-label classification method for
predicting the phenotypes resulting from complete loss-of-function
in single genes. DeepPheno uses the functional annotations with gene
products to predict the phenotypes resulting from a
loss-of-function; additionally, we employ a two-step procedure in
which we predict these functions first and then predict
phenotypes. Prediction of phenotypes is ontology-based and we
propose a novel ontology-based classifier suitable for very large
hierarchical classification tasks. These methods allow us to predict
phenotypes associated with any known protein-coding gene. We
evaluate our approach using evaluation metrics established by the
CAFA challenge and compare with top performing CAFA2 methods as well
as several state of the art phenotype prediction approaches,
demonstrating the improvement of DeepPheno over state of the art
methods.

Design of new binary classification machine learning features for automated protein function annotation
COSI: Function / CAFA 4
  • Vladimir Perović, Laboratory for Bioinformatics and Computational Chemistry, Vinča Institute of Nuclear Sciences, University of Belgrade, Serbia
  • Radoslav Davidović, Laboratory for Bioinformatics and Computational Chemistry, Vinča Institute of Nuclear Sciences, University of Belgrade, Serbia
  • Nevena Veljković, Laboratory for Bioinformatics and Computational Chemistry, Vinča Institute of Nuclear Sciences, University of Belgrade, Serbia
  • Branislava Gemović, Laboratory for Bioinformatics and Computational Chemistry, Vinča Institute of Nuclear Sciences, University of Belgrade, Serbia

Short Abstract: Vast number of new protein sequences call for new and efficient methods for functional annotation. The best performing computational methods are machine learning (ML) algorithms based on the integration of attributes from sequence, structure, expression profile, genomic context, and molecular interactions.
Our study aimed at adapting features for binary classification method in ML for protein-term pairs, central instances in automated protein function annotation. Besides the groups of basic attributes that are used in our model: evolutionary – PSSM matrices; sequence – amino acid composition and distribution of amino acids (dyads and triads); and graph – network metrics from Gene Ontology, we designed two new features: 1) BLAST – utilizing predictions based on the Basic Local Alignment Search Tool (BLAST), and 2) Naïve – utilizing predictions based on Naïve method, representing frequencies of terms in GO.
We compered models generated on these data by three ML algorithms, Generalized Linear Model, Gradient Boosting Machine and Random Forest. Datasets for creation and validation of models were obtained from CAFA Challenges. Usage of new BLAST and Naïve features in automated protein function annotation, as addition to the set of basic attributes, increases performance of prediction models by 15%.

Embeddings allow GO annotation transfer beyond homology
COSI: Function / CAFA 4
  • Maria Littmann, Department of Informatics, Technical University of Munich, Germany
  • Michael Heinzinger, (TUM) Technical University of Munich, Germany
  • Burkhard Rost, Rostlab, Germany

Short Abstract: Understanding protein function is crucial for molecular and medical biology, nevertheless Gene Ontology (GO) annotations have manually been confirmed for fewer than 0.5% of all known protein sequences. Computational methods bridge this sequence-function gap, but the best prediction methods need evolutionary information to predict function. Here, we proposed a new method predicting GO terms through annotation transfer not using sequence similarity. Instead, the method uses SeqVec embeddings to transfer annotations between proteins through proximity in embedding space. SeqVec’s data driven feature extraction transferred knowledge from large unlabeled databases to smaller but labelled datasets (transfer learning). Replicating the conditions of CAFA3, our method reached an Fmax of 50%, 59%, and 65% for BPO, MFO, and CCO, respectively. This was numerically higher than all methods that had actually participated in CAFA3 for BPO and CCO and scored second for MFO. Restricting the lookup dataset to proteins with less than 20% pairwise sequence identity to the targets, performance dropped clearly (Fmax BPO 38%, MFO 46%, CCO 56%), but continued to clearly outperform simple homology-based inference. Thereby, the new method may help in annotating novel proteins not belonging to large families.

Enzymes, Moonlighting Enzymes, Pseudoenzymes: Similar in Sequence, Different in Function
COSI: Function / CAFA 4
  • Constance Jeffery, University of Illinois at Chicago, United States

Short Abstract: The function of a newly sequenced protein is often estimated by sequence alignment with the sequences of proteins with known functions. However, members of a protein superfamily can share significant amino acid sequence identity but vary in the reaction catalyzed and/or the substrate used. In addition, a protein superfamily can include moonlighting proteins, which have two or more functions, and pseudoenzymes, which have a three-dimensional fold that resembles a conventional catalytically active enzyme, but do not have catalytic activity. I will discuss several examples of protein families that contain enzymes with noncanonical catalytic functions, pseudoenzymes, and/or moonlighting proteins. Pseudoenzymes and moonlighting proteins are widespread in the evolutionary tree and are found in many protein families, and they are often very similar in sequence and structure to their monofunctional and catalytically active counterparts. A greater understanding is needed to clarify when similarities and differences in amino acid sequences and structures correspond to similarities and differences in biochemical functions and cellular roles. This information can help improve programs that identify protein functions from sequence or structure and assist in more accurate annotation of sequence and structural databases, as well as in our understanding of the broad diversity of protein functions.

Fine-tuning of Language Model-Based Representation for Protein Functional Annotation
COSI: Function / CAFA 4
  • Ehsaneddin Asgari, Helmholtz Center for Infection Research, Germany
  • Andrew Dickson, University of California, Berkeley, United States
  • Meisam Ahmadi, Iran University of Science and Technology (IUST), Iran
  • Mohammad Khodabakhsh, University of Michigan, United States
  • Mohammad R.K. Mofrad, University of California, Berkeley, United States
  • Alice McHardy, Helmoltz Centre for Infection Research, Germany

Short Abstract: We present our approach in fine-tuning of language model-based representation for the prediction of (i) gene-ontology (GO), (ii) human-phenotype-ontology (HPO), and (iii) disorder-ontology (DO) terms, as subtasks of CAFA 4. Recently, transfer learning showed significant improvements in many machine learning problems, in particular at the scarcity of annotated data. Combinations of being self-supervised as well as being general enough, makes neural language modeling an ideal candidate for transfer learning on the sequential data. Subsequently, the trained language modeling network can be fine-tuned for any particular task, even when only a limited number of annotations are available. In CAFA 4 subtasks, we make use of language model-based transfer learning throughout the following steps: (i) we train a language model-based representation of protein sequences on a large collection of protein sequences (UniRef50) (ii) we fine-tune the obtained model for the supervised task of neural GO prediction, which relatively has more training instances than HPO and DO (iii) for the second time we fine-tune the model already tuned for the GO prediction, this time for the prediction of HPO and DO. To improve the predictions, we use an ensemble of different fine-tuning paths from language modeling to the supervised annotation prediction of interest.

Integration of large-scale expression profiles into a multi-method protein function pipeline
COSI: Function / CAFA 4
  • Yiheng Zhu, 1. Nanjing University of Science and Technology 2. University of Michigan, China
  • Chengxin Zhang, University of Michigan, United States
  • Rucheng Diao, University of Michigan, United States
  • Xiaogen Zhou, University of Michigan, United States
  • Yang Zhang, University of Michigan, United States
  • Peter Freddolino, University of Michigan, United States

Short Abstract: The Critical Assessment of protein Function Annotation algorithms (CAFA) offers an opportunity for a community to explore and consistently benchmark protein function prediction on a larger scale. Previously in CAFA 3.14 (PI), the expression-based baseline methods showed the potential of expression data complementing sequence-based information.

We previously developed MetaGO to provide protein functional annotations using an ensemble model with sequence homology, local alignments of predicted structures, and protein-protein interaction information. We have supplemented MetaGO with co-expression information by curating expression data and measuring gene-gene co-expression with rank-based discretized mutual information and mapping to proteins.

Benchmarks of the contribution of co-expression scores into MetaGO with a 1224 E. coli protein dataset showed improved performance for the Biological Process and Molecular Function GO aspects. The supplemented expression information provided enhancements to overall MetaGO performance in our benchmark sets, and was used with curated expression data to contribute to function predictions for nine species in the CAFA4 target set. Currently ongoing enhancements to MetaGo include development of a generalized expression data query pipeline for transparent incorporation of expression data into any functional query, and integrating the expression based scores as a feature in a deep-learning model for protein function prediction.

Learning sequence, structure and network features for protein function prediction
COSI: Function / CAFA 4
  • Meet Barot, Center for Data Science, New York University, United States
  • Daniel Berenberg, Flatiron Institute, Simons Foundation, United States
  • James Morton, Flatiron Institute, Simons Foundation, United States
  • Vladimir Gligorijevic, Flatiron Institute, Simons Foundation, United States
  • Richard Bonneau, New York University, United States

Short Abstract: Recently, self-supervised and unsupervised representation learning approaches have shown tremendous potential in exploring the huge volume of protein data in many databases and learning features indicative of protein function. Building upon our previous works on sequence, structure, and network data, we propose to systematically investigate the contribution of these features on classification performance on individual GO terms, study their complementarity, and build an integrative model.

We compute: sequence-based features using our LSTM language model pre-trained on ~10 million protein sequences; structure-based features from contact maps using a Graph Autoencoder pretrained on ~30k domain structures from CATH and network-based features using our deepNF model pre-trained on 6 different networks from STRING. Then we train a separate neural network (NN) for predicting GO term probabilities on each individual feature set.

We will show results from an experiment we conducted on ~18,000 human proteins using their sequences, 3D structures retrieved from PDB and SWISS-MODEL and PPI networks. We will also present the performance results obtained by training a multi-modal NN with all three feature sets for multiple organisms. We show that different modalities contribute to different GO terms, and show that the model integrating information from all sources outperforms the individual models.

MetaGOPlus: Improving Gene Ontology Prediction of Proteins Using Deep Residual Network with Hierarchical Classification
COSI: Function / CAFA 4
  • Yiheng Zhu, 1. Nanjing University of Science and Technology 2. University of Michigan, China
  • Chengxin Zhang, University of Michigan, United States
  • Rucheng Diao, University of Michigan, United States
  • Xiaogen Zhou, University of Michigan, United States
  • Dongjun Yu, Nanjing University of Science and Technology, China
  • Yang Zhang, University of Michigan, United States
  • Peter Freddolino, University of Michigan, United States

Short Abstract: Gene ontology (GO) has been widely used to annotate functions of proteins. Accurate identification of GO attributes from proteins provides critical help in understanding the biological activities of proteins. We proposed a new pipeline, MetaGOPlus, to predict the GO attributes of proteins. MetaGOPlus consists of five sub-pipelines whose predictions are effectively ensembled through logistic regression. These sub-pipelines include a new deep-learning-based model and four sub-pipelines inherited from MetaGO using structure alignment, sequence profile comparison, protein-protein interaction, and naïve probability. The newly proposed deep-learning-based sub-pipeline uses deep residual network (ResNet) as the basic model and extracts multiple sequence-based features as the input of model. Moreover, a hierarchical classification layer at the end of ResNet can fully utilize inter-relationships between GO attributes. The proposed pipeline was tested on a large-scale set of 1000 non-redundant proteins from CAFA3 experiment. Computational experimental results show that MetaGOPlus achieves better performance than other state-of-the-art function annotations methods. Detailed analyses show that the major advantages of MetaGOPlus stem from two points. First, the newly deep-learning-based sub-pipeline can effectively learn hidden relationships between proteins and GO attributes from large-scale benchmark dataset. Second, five sub-pipelines can effectively learn the knowledge of GO attributes from different and complementary views.

Predicting Anti-Cancer Peptides Using Deep Neural Networks
COSI: Function / CAFA 4
  • Nathaniel Lane, Montana State University, United States
  • Indika Kahanda, Montana State University, United States

Short Abstract: Chemotherapy, a common treatment for cancer, often comes with debilitating side effects for the patient. Anticancer peptides (ACPs) may offer a promising alternative to traditional chemotherapy. To aid researchers, there has been an interest in the field of machine learning to develop systems that can predict good ACP candidates. Previous works, like MLACP, use the chemical properties of the whole peptide as input features for support vector machine (SVM) or random forest (RF) algorithms. We propose a system called DeepACPpred that uses recurrent neural networks (RNNs) to consider the entire sequence of amino acids. To our knowledge, this is only the second system to use RNNs to address this problem, after ACP-DL. The final design of DeepACPpred combines two approaches. One uses an RNN to consider the whole amino acid sequence, while the other uses a convolutional neural network (CNN) to consider interactions among nearby amino acids. Our experimental results using a range of ACP datasets suggest that DeepACPpred is able to significantly outperform ACP-DL on larger datasets, while being comparable to MLACP performance on the others.

Predicting protein functions with deep learning and multi-source data
COSI: Function / CAFA 4
  • Gabriela Merino, Bioengineering and Bioinformatics Research and Development Institute (IBB-CONICET-UNER), Argentina
  • Rabie Saidi, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • Diego Milone, Research Institute for Signals, Systems and Computational Intelligence (sinc(i)-CONICET-UNL), Argentina
  • Georgina Stegmayer, Research Institute for Signals, Systems and Computational Intelligence (sinc(i)-CONICET-UNL), Argentina
  • María Martin, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom

Short Abstract: Identifying protein functions is crucial in molecular biology. Experimental and manually curation are extremely time-consuming and expensive, and hence it cannot cope with the exponential increase of data. Thus, computational methods for automatic function prediction are needed. Although such methods are being constantly developed, their performance is still subject for improvement.
We propose novel deep learning models for predicting Gene Ontology (GO) annotations integrating multi-source data, represented as protein association matrices. Our models were trained and evaluated on yeast. Input association matrices were based on sequence distance, transcriptomics experiments, GO terms and Reactome annotations. The F-max, commonly-used in CAFA challenges, was used for evaluating molecular function (MF), cellular component (CC), and biological process (BP) predictions. Using only sequence distances, F-max of 0.38, 0.63 and 0.35 were reached respectively for MF, CC and BP. Considering sequence distances with transcriptomics and GO data improved the F-max to 0.51 for MF. Adding Reactome information to the previous combination allowed it to reach an F-max of 0.65 in CC. Finally, using sequence, GO and Reactome data the F-max was 0.48 in BP. Our results suggest deep learning integrating multi-source data is a promising tool for protein function prediction.

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function
COSI: Function / CAFA 4
  • Amelia Villegas-Morcillo, University of Granada, Spain
  • Stavros Makrodimitris, Delft University of Technology, Netherlands
  • Roeland C.H.J. van Ham, Delft University of Technology, Netherlands
  • Angel M. Gomez, University of Granada, Spain
  • Victoria Sanchez, University of Granada, Spain
  • Marcel Reinders, Delft University of Technology, Netherlands

Short Abstract: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. In this work we applied an existing deep sequence model that had been pre-trained in an unsupervised setting on the supervised task of protein function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for deep prediction models, as a two-layer perceptron was enough to achieve state-of-the-art performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that three-dimensional structure is also potentially learned during the unsupervised pre-training.

Update on Protein Functional Annotation in UniProt in 2020
COSI: Function / CAFA 4
  • Hermann Zellner, EMBL, United Kingdom
  • Rabie Saidi, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • María Martin, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), United Kingdom

Short Abstract: With the increasing number of data generated by sequencing projects, researchers need reliable systems to provide the functional annotation of proteins. UniProtKB is using two functional annotation systems, UniRule and ARBA (Association-Rule-Based Annotator), to automatically annotate UniProtKB/TrEMBL in an efficient and scalable manner with a high degree of accuracy. These systems use protein signatures and taxonomy classifications to infer the biochemical features and biological functions of proteins. This knowledge is expressed in the form of rules: sets of IF-THEN statements coming from expert curation (UniRule [1]) or generated by machine learning (ARBA [2]). Rules are applied at each release keeping the propagated annotations up-to-date. On the UniProtKB website, information added by annotation rules is clearly highlighted as such using evidence tags. These tags can also be used as keywords to search for or filter out annotation added by a rule.
The protein function community could also benefit from those rules, as some sequences may not be yet available in public databases or could be present in highly redundant proteomes absent from UniProtKB. This has been made possible via UniFIRE (The UniProt Functional annotation Inference Rule Engine), an engine to execute rules in the URML (UniProt Rule Markup Language) format.